A Terabit multiservice Switch 



The Cyclone switch architecture enables a scalable switching 
platform from multiple gbits to multiple tbits per second in five 
custom 0.18-micron cmos ics. a wire-speed scheduling capability 
supports fight quality of service classes and a million flows. 
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net, network switches must be both {aster and 
smarter. They must not simply terminate 
high-speed optical connections (CX ■- 1 92 now 
and OC-768 in the near future). They also 
it; ust switch a large number of connn tion.s 
from dense wave division multiplexing 
(I )\XT)M) transport systems. T hey must pro- 
vide guarantees on parameters such as band- 
width, latency, loss rate, and jitter that aren't 
supported by current best-effort switch archi- 
tectures. Finally, thev mu.st provide a path to 
migrate to predominantly internet proto- 
col/multiprotocol label switching (IP/MPLS)- 
basec netwoi <s « < hunJonlng exist 
investments in the icjacv network services 
such as time division multiplexing (TDM) 
and frame relay. 

This article describes a terabit multiservice 
switch architecture designed to solve these 
problems. Because of its multiterabit switch- 
ing speed, quality of service (QpS) support 
capability, multiprotocol capability including 
TDM, and scalability, the Cyclone switch 
architecture is directly applicable in the fol- 
lowing areas and many more: 

• multiterabit switching at the Internet 
core: terabit routers and carrier-class 
ATM or MPLS switches, 

* aggregation for all optical networks; 



♦ unified packet/circuit switching phi t- 

» high-end enterprise applications. 

Figure L shows a typical application of the 
Cyclone switch architecture. In this configu- 
ration, the switch core is physically separated 
from the rest of the switch and router. (This 
architecture also supports systems with inte- 
grated switch core and line termination cards.) 

Cyclone swiu hes use vertical cavity surface 
emitting laser, or VCSl'T. arrays optimized 
for short-reach optical connections in this 
configuration. The current VCSEL technol- 
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containing PHY, 
network processors. In fact, 
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Switch architecture oven/iew 

The Cyclone switch architecture is opti- 
mized for 32-port input and output buffered 
switches. Each port supports up to an 80- 
Gbps link bandwidth. Therefore, the aggre- 
gate bandwidth of a Cyclone switch is 5 
terabits per second (2.5- 1 hps ingress plus 2.5- 
I bps egress) Each port on ists of up to four 
channels, each of which supports up to 20 
Gbps. Figure 2 shows a conceptual block dia- 
gram of the Cyclone switch architecture. 
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To prevent head of line (HOL) blocking 
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so that multicast packets incur no perfor- 
mance degradation. 

The Cyclone switch architecture supports 
QoS at wire speed. In other words, Cyclone 
switches can provide advanced QoS features 
such as fair bandwidth allocation, low delay 
jiUv I it d I i i _ i s ,. 

ing the switching speed. 'T he Cyclone archi- 
ucture accomphsru s thus by pro\ iding three 
levels of scheduling. 



1. At the input side, the 
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•forms pri- 
ing up to 
s for each 



eight classes of servia . 
output channel, 'lb select candidate pack- 
ets for transmission (one for each desti- 
nation), the switch uses c inter the deficit 
round-robin (DRR) algorithm 1 or the 
weighted round-robin (WRR) algorithm. 

2. In its center, the switch, uses a parallel 
arbitration algorithm for the maximal 
matching of pa; kets at the beads of % ir- 
tual output queues and the i r destination 
output channels. 

3. At the output side, the switch imple- 
ments either a programmable weighted 
fair queuing (WFQ) : or DRR algorithm 
tor fair allocation oi bandwidth. 

In addition, the ( ly< lone- architecture supports 
the provisioning of TDM sendee with an 
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absolute guarantee oi reserved bandwidth an; 
zero delay jitter. Thus a Cyclone switch cat 
be configured— -a fraction of bandwidth i 

der for packet switching — -as a true multiser 
vice switch. 

Finally, the Cyclone switch architecture i 
cleanly scalable 'from 2.S Tbps down to 4f 
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( Vcione switches assume chat packets an: seg- 
mented into 64- or 80-byte fixed-size cells 
before entering the Ingres;, or the switch. Each 
unicast cell contains a (t-h\ to header, including 
a special head of the cell indicator as the first 
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Figure 2. A conceptual block diagram of a 32 x 32 Cyclone switch architecture. VOC 
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byte (typically, a comma character if the incom- 
ing data are 8B10B-cttcoded), 56- or 72-byte 
payload, and a 2-byfe vmical parity (or CRC) 
field, Thus the total cell overhead, is 8 bytes. 
Each multicast cell contains a 9-byte header, 53- 
or 69-byte payload, and a 2-byte parity held. 



\hhough each c tcl < t s I \ 

es can handle a 20-Gbps raw data rate, the 
actual payload. thrwabput ss Twer due to tin 
cell tax and the round-off overhead intro- 
duced by the cc!luir»rizatior> of packets. 

for example. each Sonct/SDH (synchro- 
nous digital hierarchy; the international ver- 
sion of Sonet) frame for STS-N (OC-iV) 
consists of 7Y X 90 columns X 9 rows = 810 N 
bytes, of which 3 columns are transport ovcr- 
hcad bytes. A frame is transported every i 25 
microseconds. Therefore, 9 X (90 - 3) X N 
bytes (for example, 150,336 bytes for OC- 
1 92) are transported every i 25 microseconds. 

Furthermore, for packets transmitted over 
Sonet (IP over the point-to-point protocol, or 
PPP, over Sonet), each i P packi f is associated 
with a 9-byte pause overhead. Hence, the 
ie 1 ) I 1 | H is S SMI / ■> 0 

N/( /'^ 9) packets or 783 N ///( »+ 9) bytes 
per i 25 microseconds). Here, p is the average 
packet, size in bytes. 

Assuming that ail packets arc unicast and 
the eel! size is 64 bytes (56 byte payload plus 
8-byte overhead), a channel of a Cyclone 
switch connected to line cards that terminate 
Sonet OC-yV carrying IP packets over PPP, 
must be able to transport 783 N X 
[64(//56)]/(/>+ 9) bytes per 125 microsec- 
onds. This means that the switch must be sped 
up by a factor of 87/90 X [(A(pi%)]l(p+ 9). 
For example, if p = 40, the speedup factor 
must be greater than 1 .26. 

Figure 3 shows the throughput that a 
Cyclone switch cm sustain when each chan- 
nel is connected to a Sonet OC-192 input 
stream carrying II* packets, or an OC- 1 92 
plus an OC-48, for a given IP packet payload 
size (assuming thn c rv pa et is of the same 
size). Clearly, lot certain packet sizes, its not 
possible to sustain 100% throughput. Fur- 
more difficult it°is to sustain the 100% 

the distribution >f packet paylnad si/e dictates 
that the speedup factor should be optimized 
-0 Jvcts(TCP I ac rowl- 

e, > g£ $ fggg 

A fully configured Cyclone switch contains 
32 input/outpu t port cards. Each port, as shown 
in Figure 5, can transmit and receive 8B10B- 
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encoded data at up to 100 Gbps (80-Gbps 
decoded data). Each port is organized as four 
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receive 8B iOB-encoded data at up to 25 Gbps. 

Each channel consists of up to 8 to 10 phys- 
ical se rial linl< sin and each link 
transports data at 2.5 Gbps. Cyclone switch- 
es configured with optica! ;ir;ks use VCSEL 
arrays for transmuting and receiving -figures 



2.5-Gbps link (8B10B encoded) 

Maximum 
8-10 



links. 
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The Cyclone switch can he configured as .1 
unified packet/circuit-switching platform. 
Figures 6 and 7 illustrate the packet and TDM 
flows. 

Packets entering the .switch are stored in the 
input buffi-rand their pointers and lengths in 
the; appropriate hues of the 
queues, according to their de 

VOQsand t hit m,> „ , t 
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and their 


) the output buffer 


appropria 
(labeled a. 
6), where 


e output queues 
EDFQ in Figure 
they are sorted, 



based on their priority levels, 
and selected for transmission. 

TDM frames must be cel- 
iuiarized in the Cyclone cell 
format before entering die 



. Upoi 



ranee, the 



.1 output Figure 5. Port organization. 
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Figuie 6. Packer flow assuming four channels psr port. FDPQ == earliest deadline first queu€ 



packet traffic). TDM frames TDM 
are switched at the crossbars, links 
as in the case of packets: how- 



stored it, the output buffer and 
their pointers in the TDM 
FIFO queue (segregated from 



ure 7. TDM flow. COSQ = class-of-servlce queue; klFOQ = first-in, first-out Queue 
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Cell clock = 125 MHz 



they arm i% Hmuvyr, pickers 
arriving on channel z'are inde- 
pendent of packets arriving 
on channel / In other words, 
the [ IK) ordering i ck ts 
isn't maintained across the 
channels. 

Arriving cells arc paral- 
lelized in the manner depict- 
ed m Figure 8 before they are 
stored in the memory buffer 
one ceil at a time every 8 tis. 
( a ii parailcliw:r:on is stag- 
gered by 8 ns so that the 
memory controller and the 
memory butter process only 
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4 channels per port. 
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Memory controller 



_ Multicast 

reference count 



Since a cell rime (the 
amount of time it takes to 
transport a cell) on a 2-( .bps 
link is 320 ns for cell sizes of 80 
bytes, the maximum number 
of 2-Gbps links per port (for a switch config- 
ured for an 80-byte cell size) is 40. If a port com- 
prises more than one channel, the servicing of 
the channels is interleaved. An example is link 
0 of channel 0 followed by link 0 of channels 1 , 
2, and 3 followed by link 1 of channels 0, 1, 2, 
and 3. This scheme h as the effect of maximiz- 
ing the distance' between two adjacent links in 
the same channel so that the switch can better 
tolerate the arrival time jitter between two con- 
; arriving on the same channel. 



Figure 9. Memory contioii;?r 



Memory controller. The memory con 
shown in Figure 9, manages the 
buffer as well as the linked lists for pa 
free cell slots. 



o'iie 



the packet queues), i'hev exit: on the output 
links J. igna.n d for hi !\j traffic . 



The ingress side of a Cyclone switch con- 
sists of a memory subsystem (which includes 
both :hc mmnor mdw: and rh controller; 
and virtual output queues. 

Each packer In its entirety is carried on a 

multiple links. Packets arrwlng at each port 
are assumed to be grouped by channels. 



arrives, the memory cor 
; a cell pointer. If the cell 
a packet, it also assigns 
'hich is sent to the quern 



tches 
belonging to the 
the same channel) 



that 



packet 



,v (and arriving on 
mittedm the older 



> When ace 
trollerassig 
the head c 
packet ID, 
The cell is then 
ry buffer. 

• When a packet 



packet 3D. Packets are read out oft! 
memory buffer one cell at a time, and o 
pointers are recycled. 

The memory controller also assists in tl 
multicast management. When a mulrica 
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Link 0.0.0 ^£ 



Link 1.0.0 ^X! 
Link 0.1.0 
Link 1.1.0 
Link 0.0.1 
Link 1.0.1 



packetairn.es. its stored iri the 
input memo! . buffei and it . 
iD is copied to virtual output 

p l , .-I I inter. J. .i SLt 

n tiot s imulta eo is 1 S 
its packet ID is copied to the 
VOQs tk I .op 
each destination, according !o 
each VOQs scheduling ente- 
ricsn. The packet remains in 
the input buffer until every 
intended destination receives 
a copy of the packet. The 
memory' controller maintains 
the reference count of each 
multicast packet. Each time a 

the count is decremented. 
When the reference count of 
the packet reaches zero, its cell 
pointers are recycled: The packet 
from the input memory buffet . 



lkickj>iam /'(■■ luj An in: is! ' U> 1 big pioi lern 
occurs when packets from multiple input ports 
destined for the same output port simultane- 
ously appear at tin head •• or t'h, corresponding 
input queues, if the output port can only 
receive one packet at a time, all but one pack- 
et are blocked. To minimize this problem, 
designers can configure the backplane band- 
width as twice the input bandwidth. That is, 
the Cyclone switch allows thi backplane to be 
sped up by a factor of two. This speedup is 
accomplished by increasing the number of 
backplane links and be alloc, no; tip to twice as 
many cells to be read our of the memory buffer 
as write)! in. As shown in figure- ! 0, the back- 
plane knks are serviced In the same staggered 
fashion as the input links: links 0.0.0 and 1 .0.0 
followed by links 0. 1 .0 and 1 .1.0, and so on. 

Cyclone switches with nan channels per 
port use two input memory hanks p c r port to 
facilitate backplane speedup. (An important 
design criterion was to avoid using an expen- 
sive mi mon design. I'-.wh memory bank is 
implemented as a dual-port memory with an 
<"i n i y leti in. . is; it urchai ne 

0 and 2 are stored i n bank 0. arid packets from 
input channels 1 and 3 in bank f , Both mem- 



n buffer x, out buffer y, link z 



Figure 10. Backplane link timing, kink x.y.z means link n 
ated with input buffer x and output buffer y. 



a n t <//;i hi i ciortc s\\ hcl use , 
VOQs to prevent head-of-iine blocking prob- 
lems.'* A separate VC )Q is assigned to each out- 
put channel. Systems with 32 ports and 4 
channels per port require 128 logical VOQs. 
Each V: )Q entry consists of a pair (packet I! ). 
packet length). 

Recall that to support backplane speedup, 
two memory reads arc required each cycle. To 

rtee tl n | r 1 I for trans 
mis-Jon an. from different hanks, the Cyclone 
switch uses two sets of VOQs. One set handles 
pack, is sior, din bank 0 (packets from input 
channels 0 and and the other handles pack- 
ets stored in bank I (packets from input chan- 
nels 1 and 3). 

Furthermore, each VOQ is organized as up 
to 8 class of service queues (COSQs) for ser- 
vice differ* nti ttion. i bus the maximum num- 
ber of COSQs is 2,048 (2 input memory 
banks X 32 outran ports X 4 channels per port 
X 8 COS levels). 



V banks a 



the ba I pi m links m assigned to bank 0 ;,t 
the other half to bank 1. 



Input schedui • , - v < \ ' determines a 
candidate packet ior transmission using one 
of two algorithms: DRR for variable-length 
packet traffic and weighted round robin 
(WR) 5 for fixed-length cell traffic. These 
algorithms select a candidate packet for each 
VOQ, from one of the (up to) eight COS 
Ay. Half queues that comprise a VOQ. Both DRR and 



WRR are system-c 
modified to have ut 



tbie 



be 

priorities. 



the switching plane associ 
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Figure 11. Backplane configuration. 
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;>!sR!sTsci.hcd k 



The DRR a 
packet based oi 



unused shar 
turn. For ex 



is added to thee 
jnpie, if the nor 



Jit for then. 
i.riai credit is 



and the two packet:, at the 
head or queue contain 6 and 
3 cells, only the hrxt ptu kct is 
served and the credit for the 
next turn becomes 10. If the 
queue is empty, its credit 
remains zero until it is back- 
It -s. v • i u.n 

\v\ v : 

Wi th a speedu p of two, the 
number of links from each 
input to the' backplane is 80 for 
a cell size of 80 bytes and 64 for 
a 64-byte cell size., as depicted 
in Figure ! i. Likewise, the 
number of links from the 
backplane toeacb output is 80 
(64 j. As in the case of packet 
transmission on input links, 
each packet is carried on a 'link; 
in its etui rely— -packets aten't 
striped across multiple links. 
The placement of cells on two 
adjacent pairs of links is stag- 
ceii d he Is us. ts 1 ox, i in i I 
ure 1.0: Cells are placed on a 
pair of links every 8 its. 1 .ike- 
wise, the arrivaJ of cells on two 
adiT cut pairs of links from the 
I >. k plane to an output port is 
k ' ' ■• ' herefore, it 
takes 320 ns to sendee all 80 
links (256 ns for 64 links), 
whi h is. qtial toon, cell dc!a> 
on a 2-Gbps link. 

irbitration. As shown 
In figure 12. two logical 
u iiiii d ii tre used in 
Cyclone switches with four 
channels per port. One switch- 
ing plane connects to one 
input memory bank and the other plane to the 
other input memory bank, bach switching 
plane allows up to 2>: backplane speedup, out- 
going packets >tn eac e ng plane re 
sent to two output m« taorv banks. 

multiple (up to eight) ph 
planes, each of which is c 
arbiter, as shown in Figi 
switching planes operate cc 



tching 



; 13. Physical 
:urrent!y (with 
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S- lis sraggciirig between adjacent planes) Ki 
allow multiple concurrent packet transmis- 
sion from one input port to another. Note that 
no packet-ordering problem exists, even 
though each arbiter makes independent arbi- 
tration decisions. This is because each COSQ 
maintains a Hl'O ordering of packets. 

Each of the 80 .sets of links (for switches with 
a cell size set to 80 bytes and a full 2x backplane 
speedup) is mapped to a r toss bar. Up to five 
crossbars are assisted n > < at h physical switching 
plane. An arbiter assigned to a physical switch- 
ing plane nukes arbitration (scheduling; dct I- 
sions for all the crossbars in the plane. /Ml the 
s !'<!ssk;.:-s its list he reconhnurcri in 320 lis. Since 
each arbiter is responsible for reconfiguring S 
crossbars in 320 ps so.p.;cm.iallv (on.e crossbar 

at a time), arbitration decisions each ofwhich 

applies to a different crossbar ate made every 

64 ns. Note that a new reconfiguration decision 
for t ach crossbar is made every 320 ns. 

Ha-MiducfiLirlMirattmi ,mdV<)(#mi>in; ,» 
munication MvtocoL bach instance of the arbi- 



aximal matching algorithm called 
■ traduced byMc.Keown/Howev- 
on distinctively differs from MeK- 



apply to 
multiple 
multiple 



co<a ns in rh.:: arbitration decisior 
packets, not just cells, and thai 
instances o: :'hc algorithm Inn ol 
wenching plane-- tone tor eat h pla: 

furthermore, if each port is configured with 
four channels, up to four packets can contend 
for each output port from each input port at 
any given time (one packet for each channel). 
'1 he C vclonc switch uses a hierarchical arbitra- 
tion algorithm re funnel the numbs rot pat kcts 
to at most one for each output port from each 
input or 1 hown i igu L t t page 

iwu communication links an.- used between 
each input port and an arbiter- — one link each, 
direction. The link is the same type used for 
data, but its separate from data links: The 
control links are sideband. Note that arbitra- 
tion demise t c < | | 
not a per-cell basis. C nice a decision is made 
on a link, the link is locked for the duration of 
the packet: transmission on the link. The corn- 
protocol follows: 



1. VOQs from every input port make bids 
(and unlock the output links on which 
the packet ttan i i f i _ 
pleted). 

2. The arbiter computes a match based on 
the maxima! matching algorithm. 

3. The arbiter accepts a bid for every uti- 
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Figure 15. Backplane redundancy. 
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Mido 



id the first bid is accepted, the sc 
ided. 



Cyd< 
ly or 

links). Thus 
shuts down c 



' , <h i >i ';,.< <i, , In, / , , ] [f, 
vitches, each packet is carried entirc- 
iink (not striped across multiple 
urd failim on a link simply 
link lk-i ausc the .switch use* 



a large number of links (64 or 80) from each 
input port to the backplane, a link failure caus- 
es aver)' small degradation in performance. 

l-'urrhctmotv. the backplane speedup of two 
can he viewed as i + 1 backplane redundan- 
cy, as long as ever, arbitration channel is split 
into two cards, as depicted in figure 15. 

witches, Cyclone 
lary and redundant 
e side causes some 
c (latency increase), 



ional 

switches operate both 
channels. A failure oi 
degradation in perk 
not a. catastrophic sh, 



locked output port and locks the corre- 
sponding output link for the duration of hps* 
packet transmission across the backplane. The memory subsysn 

A bid/match/acccpt cycle requires multi- that the egress memory 

stage pipeline processing, which means that accommodate two write 

new bids may he made for the next link 8 ns, instead of one wri 

behit'e' receiving the acceptance message for on the ingress side, ( a.ck; 

the Prior bids. If back-to-back bids are made char, nels per port use tw 



es with four 
■v bank, pet 
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port on the egress side. The egress side sup- 
ports one output queue per channel. 

Output scheduling ' wlon u lutccturc sup- 
ports two types i i i tkris u 
the ^gr^< dc \ r coar e-grain 
f.<:ln i 1 \ 1 v J 1 1 I i. i_ 
(WFQ; for fine-grain scheduling. 

If DRR/WRR is selected, the scheduler 
selects a packet for transm ission from one of 
the eight possible ■■ X )S qt:c;u s th.tr comprise 
a queue. (The DRR/WRR algorithm used on 
the egress side is the same as the one on the 
ingress side.) 

If WFQ is selected, the WFQ scheduler 
assigns a service deadline for each packet based 
on the flow to which it belongs, and then sorts 
the packets by the deadline and selects the 
packet with the earliest deadline for trans- 
mission, hi the Cyclone architecture, the How 
is defined as a class of service supported on 
each low-speed output eon Section, for exam- 
pie, OC-3 or DS-3. Thirty-two thousand 
flows are supported per port (or 1 million 
flows total in 32-port-by-32 port configura- 
tions). The earliest-deadline-first queue 
(EDFQ) then sorts the packets by the dead- 
line and selects the nackct with the earliest 
deadline for transmission. 

As shown in figure I 6, the ( a/clone archi- 
tecture uses a type of WFQ called self-clocked 
fair queuing (SCFQ}." The deadlines for the 
packets destined for output channel x are 
computed as 
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/•;. is the dt adline of the list packet that has left 
html is h f. t M sk; >r d P 

is thi bandwidth allocated to How /. 



order 



: the Jcadm 



nh,gu 



ed v 



s the 



following approach: No two deadlines of the 
packets stored in the queue (as well as the 

packets in the highest speed/highest priority 
flow (for example^ OC- 192c at COS level 7) is 
nominally the length of the first packet (in the 



Dataflow control 



Departure time 



Service state 



Figure 16. Self-clocked fair queuinc 



number of ceils). Thus the difference in dead- 
lines of two consecutive packets in the lowest 
speed/lowest priority flow is /V X M X the 
length of the first packet. Here. N'is the ratio 
of rates ot the highest and the lowest speed con- 
nections allowed, and A/is the ratio of band- 
widths allocated for the highest, and the lowest 
priority flows with the same speed. Hence, the 
maximum difference in the deadlines would be 
L X ,Vx MX Q. Here, L is the maximal trans- 
fer unit in ceils, and Q is the number of queue 
entries. If L= 256 , ;V== 256, M- 32 , and Q 
= 1,024, the dynamic range of the deadline is 
2 3i . The Cyclone WFQ scheduler uses 32-bit 
deadlines, which constrains the deadline 
dynamic rangi to 2 billion time units. 

I\DI : Q. ["hi;, queue"" is a pipelined chain of 
discrete, locally interconnected stages (a sys- 
tolic array). It sorts packet pointers by their 
deadlines so that the packet with the earliest 
deadline is selected for transmission every 8 
ns. (The FIFO ordering is maintained for 
packets with the same deadline.) 

The EDFQ receives two packet pointers 
and selects one packet for 
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lies? deadline in a constant (and short) 
amount of time, the EDFQ always maintains 
the packet with the earliest deadline at the 
head of the queue. 

' \ 'pi:t - ' ;> 7 -\s hi H a . asc ,>r p ^ k r rr -n\ 
mission on in i i 1 t in its entire- 

ty is carried on a link The placement of ceils on 
two adjacent links is staggered by 8 ns, because 
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the EDFQcan only transmit one packet every 
8 as. See Figure 17. In this example, four links 
are reserved for TDM and the rest for packets. 

The Cyclone switch architecture supports 
the flow control at the COS-level granulari- 
ty Whet? a COS level is oversubscribed at an 

egress epic iic, the corn sp( >: i J i n g \ X >S ijucuc s 

at the ingress side (for the output channel) arc 
backpressured. i his backpressure message 
must be broadcast to all of the corresponding 
COS queues on the ingress side, so the 
< >c or en h * l the a . . c ...o. 
this message. Likewise, when a (JOS queue at 
the ingress side becomes nearly full, the exter- 
nal line card that feeds the ( '.OS queue is back- 
pressured. This back-pressuring can be 
localized to a COS level for a particular out- 
put destination channel. 

Furthermore, the Cyclone switch architec- 
ture supports the sepai ate How control of each 
input channel. Wtu n packets from an input 
channel are oversubscribed at an output 

is a I I out pu 

queue at the input channel is backpressured. 

Simulation results 

To quantify rhc performance of a typical 
configuration of the ( iyclonc switch architec- 



ture, cycle-accurate behavioral simulator of a. 
reference system was used with the following 
parameters: 

* 32 full-duplex ports, 
» 4 channels per port, 

* 16- Gbps channel bandwidth; 

* 64-byte cell size, 8-byte cell overhead, 
and 

* a backplane speedup of 2. 

'two t\ pes of packet length, distribution were 
usc;.b pcouicu-iv distribution w ith a mean of 32 
ceils and MCI Internet distribution.-'' The dis- 
tribution over input and output channels was 
uniform. The offered load varied from 40% to 
100%. A 70% load corresponds to the full OC- 
192 payload for both typ.es of packet length 
distributions; 100% corresponds to approxi- 
mately 1.4 times the OC-192 payload. 

Figure 18 shows a latency comparison of 
the Cyclone switch configured as just 
described to a baseline switch configured as 
follows: 

" 32 full-duplex ports, 

* I channel per port; 10-Gbps channel 
bandwidth, and 

* no backplane snccdup (input buffering 
only). 
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Figure 17 Output timing: cell size 
nel and '■ channels per port. 



= 80 bytes; Sink speed = 2 Gbps; there are10 links pet charv 



The latency is measured in 
number of 8-ns clock cycles. 
As expected, the latency of 
the baseline switch increases 
exponentially as the offered 
load approaches 70% ( ! 00% 
of its channel bandwidth). 
However, the latency of the: 
Cyclone switch remains flat 
up to a 90% offered load and 
increases to approximately 
5,000 clock cycles at 100%. 

Figure 19 shows peak 
buffet and queue occupancy 
measurement results for the 
MCI internet distribution. 
Again, the distribution across 
mput and output channels is 
un iform.. Both the queue and 
buffer occupancies rise mod- 
erately exponentially as the 
offered load increases hcyon d 
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Figure 1 8. Latency versus offered load with 64-byie cells and 8-byte overhead. Baselin. 
speedup; 32 10-Gbios channels. Gydone: 1.6:< speedup; 128 16-Gbp:; channel:;. 
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tion; packet destination = uniform across 128 channels. 



the full OC-192 load. Clearly, the ingress 
queue/buffer occupancy Is one order of mag- 
nitude lower than the egress queue/butter 
occupancy, which is cxpc-i ted w it.h the signif- 
icant backplane speedup- 

The Cyclone switch architecture has sev- 
eral key attributes. 

!, it's a true terabit switching platform.' It 
can terminate and switch more than 128 
OC-192c. 

2 t SUj km s mu t t ksJicd 

uling algorithms, such as DRR, WRR, 
and WFQ, on the ingress side as wc-U as 
on tht egress side, which is necessary to 
assure end-to-end quality of service. 

3. It's designed to support the TDM service 
naturally — for TDM traffic, the switch 



guarantees reserved bandwidth and no 
delay jitter, 
t p c ) i i dies packet 

traffic and ceil oainc ccpiaiiv w. Ik 

5. It's scalable from 40 Gbps to 2.5 Tbps 
x • oi sign langes. 

6. It features robust fault tolerance and 
redundancy both at the Sink level as well 
as at the board level. 

The Cyclone switch architecture has a 
potential to change the way future multiser- 
vice switches are designed because of its scal- 
ability, flexibility and ability to handle 
multiple protocols equally well. illCUD 
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